Analysis of OMICS data, practical 8

Transcriptomics part 2: Differential expression analysis

Eszter Ari

Eötvös Loránd University, Budapest & Biological Research Centre, Szeged

April 4, 2025

Work-flow of RNA-seq data analysis

  1. Extract expressed RNA, sequencing → fastq file

  2. Pre-mapping quality checking, trimming (filtering)

  3. Read mapping to reference genome OR de novo assembly of transcripts

    • Post mapping quality checking
  4. Read counting

  5. Quantitative or Differential Expression Analyses: comparing expression levels

  6. Functional enrichment analysis: GO, pathways…

Normalization

  • It is not possible to do absolute quantification using the standard RNA-Seq pipeline
    • because it only provides RNA levels relative to all transcripts
  • The counts need to be adjusted to be comparable across samples and experiments
    • because the total coverage (sum of all read counts) differs across samples
    • Relative vs. absolute expressions

Sampling reads from a population of DNA fragments is multinomial

For a single gene, it is a coin toss, i.e. binomial

\(Y_{i}\sim Binomial\left ( M,\lambda_{i} \right )\)

\(Y_{i} =\) the observed number of reads for gene \(i\)

\(M =\) the total number of reads (library size)

\(\lambda_{i} =\) proportion (relative abundance) of gene \(i\) reads (lambda)

Large \(M\), small \(\lambda_{i}\) → approximated well by Poisson \(\left ( \mu _{i} = M\cdot \lambda _{i} \right )\) (mu)

Technical replication vs. biological replication

Technical replication vs. biological replication

Mean-variance plots

What we see in real data

Model assumptions

\(M =\) the total number of reads (library size)

\(\lambda_{i} =\) proportion (relative abundance) of gene \(i\) reads (lambda)

Poisson describes technical variation:

\[ Y_{i}\sim Poisson\left ( M\cdot \lambda_{i} \right ) \] \[ mean\left ( Y_{i} \right ) = variance \left ( Y_{i} \right ) = M\cdot \lambda_{i} \]

Negative binomial describes biological variation:

\(\varphi =\) dispersion parameter (phi)

\[ Y_{i}\sim NB\left ( \mu_{i} = M\cdot \lambda_{i}, \varphi_{i} \right ) \] Same mean, but variance is larger (quardatic):

\[ variance\left ( Y_{i} \right ) = \mu_{i} \left ( 1 + \mu_{i} \varphi_{i} \right ) \]

Critical paraemeter to estimate: \(\varphi_{i} =\) dispersion

Reasons behind differences in read numbers

  • Library composition (i.e. relative size of the studied transcriptome) can be different in two different biological conditions.
  • Library size (i.e. sequencing depth) varies between samples coming from different lanes of the flow cell of the sequencing machine.
  • Gene length: Longer genes will have higher number of reads.
  • GC content biases across different genes and samples may lead to a biased sampling.
  • Read coverage of a transcript can be biased and non-uniformly distributed along the transcript.

Basic normalization approaches

Addressing library size biases

  • CPM: Counts Per Million
    • Normalizes the counts by the total number of reads in the library.
    • \(CPM = \frac{\text{Nr. of read counts of the gene × } 10^{6}}{\text{Total nr. of counted reads}}\)

Addressing library size and gene length biases

  • TPM: Transcripts Per Million
  • RPKM: Reads Per Kilobase of transcript per Million mapped reads
    • Normalizes by the the length of the gene also.
    • \(RPKM = \frac{\text{Nr. of read counts of the gene × } 10^{3} × 10^{6}}{\text{Total nr. of counted reads × Gene length}}\)

More advanced normalization approaches

  • DESeq2’s median of ratios:
    • Normalizes the counts by the size factors of the samples.
    • Size factor is calculated from the median of ratios of each gene’s counts to the geometric mean of all genes.
  • edgeR’s TMM: Trimmed Mean of M-values
    • Normalizes the counts by the total number of reads in the library and the length of the gene.

Experimental layout

2 groups:

  • Question: Which genes do express significantly differently between the two groups? → p-value
  • The direction and magnitude of the difference → Fold change
  • pairwise DE analysis

Experimental layout

More treatment types, more groups:

  • Question: Does a factor (treatment type) couse DE? In which genes?
    • Factors: e.g., obese or not; male or female …
  • We use a Generalised Linear Model (GLM) to calculate the p-value
  • The direction of the difference → Fold change

GLM

GLM

GLM

Visualization of DE results

MA plot: transforming the data onto \(M\) (log ratio) and \(A\) (mean average) scales

Visualization of DE results

Volcano plot

Multiple testing correction

  • We calculate the same statistical test several times → all genes we investigated
  • If we use \(p = 0.05\) as a threshold: we have a \(5\%\) chance to accept something significantly differently expressed when the expressions were not different.
  • False discovery rate (FDR) correction based on all p-values, i.e. Bonferroni or Benjamini-Hochberg correction